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ABSTRACT 

The Mouse Genome Database (MGD, http://www 
.informatics.jax.org) is the international community 
resource for integrated genetic, genomic and 
biological data about the laboratory mouse. Data 
in MGD are obtained through loads from major 
data providers and experimental consortia, elec- 
tronic submissions from laboratories and from the 
biomedical literature. MGD maintains a compre- 
hensive, unified, non-redundant catalog of mouse 
genome features generated by distilling gene 
predictions from NCBI, Ensembl and VEGA. MGD 
serves as the authoritative source for the nomencla- 
ture of mouse genes, mutations, alleles and strains. 
MGD is the primary source for evidence-supported 
functional annotations for mouse genes and gene 
products using the Gene Ontology (GO). MGD 
provides full annotation of phenotypes and human 
disease associations for mouse models (genotypes) 
using terms from the Mammalian Phenotype 
Ontology and disease names from the Online 
Mendelian Inheritance in Man (OMIM) resource. 
MGD is freely accessible online through our 
website, where users can browse and search 
interactively, access data in bulk using Batch 
Query or BioMart, download data files or use our 
web services Application Programming Interface 
(API). Improvements to MGD include expanded 
genome feature classifications, inclusion of new 
mutant allele sets and phenotype associations and 
extensions of GO to include new relationships and a 
new stream of annotations via phylogenetic-based 
approaches. 



INTRODUCTION 

The Mouse Genome Database (MGD) (1-3) serves as a 
primary resource for mammalian biologists, delivering a 
spectrum of genetic, genomic and biological data support- 
ing the use of mouse as a model for understanding human 
biology and disease. Central to its data offerings are 
the canonical mouse gene catalog, nucleotide and 
protein sequence associations, gene-to-function assign- 
ments based on the Gene Ontology (GO) (4), a compre- 
hensive catalog of mutant alleles, associations of mutant 
genotypes to their phenotype through the Mammalian 
Phenotype (MP) Ontology (5) and to the human diseases 
for which they are a model through curated associations 
to human diseases in Online Mendelian Inheritance in 
Man database (OMIM) (6). In addition, MGD provides 
a comprehensive genetic map, a genome browser (Mouse 
GBrowse) for genome viewing, Single Nucleotide 
Polymorphisms (SNPs) and other polymorphisms and 
mammalian orthology data. A summary of the current 
contents of MGD is given in Table 1 . 

Integrated with MGD are other components of the 
Mouse Genome Informatics (MGI) database resource 
(http://www.informatics.jax.org). These include the Gene 
Expression Database (7), the Mouse Tumor Biology 
Database (8) and the MouseCyc database of metabolic 
pathways (9). Two additional resources tied to the main 
MGI resource are the International Mouse Strain 
Resource (IMSR) (10) and the Recombinase (ere) 
Portal (1). 

Data in MGD are obtained through data loads from 
major resource providers [e.g. sequence data from 
GenBank, gene models from NCBI, Ensembl, VEGA, 
mutant alleles from N-ethyl-N-nitrosourea 
(ENU)-mutagenesis groups and International Knockout 
Mouse Consortium (IKMC)], from electronic submissions 
from investigator laboratories, and from the biomedical 
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Table 1. Summary of MGD data content (14 September 2011) 



Oenes with nucleotide secjuence data 


Zo 6UJ 


Genes with protein secjuence data 


Z J U / U 


frpTipc witn rniit',irit nllf^lpc in rnif^f 1 
\JGI1CB Willi IJlllltllll tlUClCS 111 1111LC 


15 145 


Genes with one of more mutant alleles 


ZU J7 / 


Total mutant alleles 11 


738414 


Number of cre-containing transgenes and knock-ins 


1511 


Genes with mouse experiment-based functional 


13 524 


(GO) annotations 




Mouse/human orthologs 


17 847 


Mouse/rat orthologs 


16 686 


Human diseases with one or more mouse models 


1121 


QTLs 


4670 


Number of references 


169 700 


Number of reference SNPs 


10089 892 



"Mutant alleles include those occurring in mice and those existing only 
in mouse ES cell lines. Of the 738 414 total mutant alleles, 682 745 are 
gene traps in ES cell lines. 



literature. All data are attributed to the original source 
with access to references provided via PubMed where 
available. For data loads, quality control reports are 
generated that enumerate format and/or content 
anomalies and prioritize errors that need attention by 
curators. Standards for gene, allele and strain nomencla- 
ture, and for functional, phenotypic and human disease 
annotations using vocabularies and ontologies enable con- 
sistent annotations and robust data retrieval. 

MGD data can be accessed in many ways. A Quick 
Search box appears on all web pages and provides a 
ubiquitous, fast and simple entry for broad keyword or 
ID searches. More specialized query forms, accessible 
via the Search pull down on the navigation bar, allow 
multiparameter advanced searches, and the data content 
area icons on the homepage lead users to specific accesses 
to that data area. A vocabulary browser supports access 
to MGD content through ontology terms. A variety of 
regularly updated database reports can be accessed on 
the File Transfer Protocol (FTP) site. Programmatic 
access is provided through web services and through 
direct SQL access. 

KEY UPDATES AND CHANGES IN 2011 

Expanded classification terms for genome features 

New to MGD are feature type classifications as attributes 
of genome features. The feature types allow users to refine 
searches to include only specific classes of genome features 
(protein-coding genes, mircoRNAs, lincRNAs, 
Quantitative Trait Loci (QTL), transgenes, pseudogenes, 
etc.). Most of the classification terms and definitions are 
derived from the Sequence Ontology (SO) (11). We have 
also added new subclassification terms for genome 
features formerly grouped as pseudogenes. The overarch- 
ing term for these genome features is now pseudogenic 
region (SO: 0000462), defined as a non-functional 
feature descended from a gene or other functional 
feature. In MGD, three subcomponents: pseudogene (a 
sequence that closely resembles a known functional gene, 
at another locus within a genome, which is non-functional 
as a consequence of mutations that prevent its 



transcription or translation); pseudogenic gene segment 
(a recombinational unit of a gene which when 
incorporated by somatic recombination in the final gene 
transcript result in a non-functional product); and poly- 
morphic pseudogene (a pseudogene lacking function 
owing to a SNP or deletion/insertion, but in other indi- 
viduals/haplotypes/strains the gene is translated) are cur- 
rently in use. Where MGD, VEGA, Ensembl and 
National Center for Biotechnology Information (NCBI) 
disagree on the pseudogene subclassification type, a 
biotype conflict note is presented to the user on the 
MGD locus detail page. Where a genome feature is a 
non-functional pseudogene in some mouse strains, but 
functional in other mouse strains, a strain- specific note 
is presented on the detail page (Figure 1). 

Nomenclature harmonization: T-cell receptor and 
immunoglobulin gene segments 

Working with the Immunogenetics Information System, 
IMGT/Gene-DB (12), MGD has expanded the number 
of defined T-cell receptor and immunoglobulin gene 
segments (a gene component region which acts as a 
recombinational unit of a gene whose functional form is 
generated through somatic recombination) to over 
670 and harmonized nomenclature for these important 
immunological gene segments. 

Mutant allele sets added 

The number of mutant alleles in MGD has increased by 
over 23 640 this year. This largely reflects ongoing devel- 
opment of genetically engineered and ENU-induced 
mutations by major mutagenesis programs, with signifi- 
cant contributions by individual investigators. Among 
the major additions of new mutant alleles to MGD 
were: 8364 new targeted mutations added from the 
IKMC (13), 870 new transgenes added from the Gene 
Expression Nervous System Atlas project (14), 492 new 
targeted and gene trap mutations from a Genentech/ 
Lexicon collaboration (15) and over 200 new ENU muta- 
tions from Dr Bruce Beutler's Mutagenetix program (16). 
Over 3000 new mutant alleles were developed from 
investigator-initiated experiments and added to MGD 
from biomedical literature curation or via investigator 
data submissions to MGD. The remaining approximately 
10000 new alleles are gene traps added via a data load 
from NCBI's Genome Survey Sequences Database (GSS) 
(17), most of which were generated by the IKMC. Of the 
current more than 596000 mutant alleles for mice, most 
were generated and only exist in Embryonic Stem (ES) cell 
lines, with approximately 30400 of these being either 
created or developed into living mice. 

The Quick Search tool now includes mutant alleles 

To take advantage of the large number of new mutant 
allele resources, MGD has improved the characteristics 
its Quick Search tool, so it now returns the alleles, as 
well as other genome features, most closely associated 
with a query. (The previous implementation of the 
Quick Search returned genome features at the level of 
the gene.) This change helps users more easily locate 
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Figure 1. Screenshots of the upper portion of two locus detail pages. (A) The BioType Conflict indicator (upper right), when opened, displays the 
different biotype annotations for Psme2b-ps. In this case, MGI and NCBI assign this marker as a pseudogene, where VEGA and Ensembl have 
assigned the status as protein coding gene. Links are provided to the underlying evidence that support the biotype assignments by different 
annotation groups. (B) The strain-specific marker indicator (upper right), when opened, displays information about strains in which the gene 
(in this case Ren2) is found or, not found, in the genome, with supporting reference links. 



relevant mouse model data from queries for phenotypes or 
disease. Given that there are Quick Search accounts for 
>90% of the interactive MGD searches, we expect this 
change to have significant beneficial impact (Figure 2). 

Extensions to GO annotations 

GO annotations are being extended via phylogenetic- 
based approaches. Through identification of phylogenet- 
ically related orthologous, homologous and paralogous 
genes across species, the GO consortium is promoting 
coordinate annotations of these genes across organisms. 
MGD is actively participating in these gene annotations to 
enrich functional information about a highly curated 
set of phylogenetically related genes among species and 
to enable propagation of functional annotations between 
organisms (18,19). 

Retooling MGD infrastructure: a plan for the future 

MGD is in the process of a significant infrastructure 
migration project to move from the Sybase relational 
database management system to a more technically attract- 
ive open source database technology (PostgreSQL). Phase 
I of this project is to move and rewrite software on our 
public servers, specifically those components supporting 
the web interface and direct SQL accounts. As well, 
we are retooling the web interface software to use Solr 
and Lucene to handle most querying, Java Spring 



Model- View-Controller (MVC) for web page generation 
and YAHOO User Interface (YUI) for on-page interactiv- 
ity. Beyond the user benefits visible in the initial release, 
this technology migration will position us well for future 
developments. Phase II, to migrate and retool the software 
residing on our back end servers (where the data loading 
and curation occur) is also underway. 

New direct access methods for MGD 

MGD has always provided direct SQL access to a public 
Sybase server. As part of the migration described in the 
previous paragraph, the Sybase server has been retired, 
and a public PostgreSQL server is now available. 
In addition, for users who want MGD at their local 
sites, we now provide complete database dumps for both 
PostgreSQL and MySQL. The public SQL server and 
the database dumps are updated on a weekly basis. 
Dump files are available from our FTP site at ftp: //ftp 
.informatics.jax.org/pub/database_backups/. Instructions 
can be found at http://www.informatics.jax.org/software. 
shtml. Contact MGI User Support (mgi-help@ 
informatics.jax.org) to request a PostgreSQL account or 
for assistance in using the database dumps. Individuals 
interested in programmatic and bulk access may also 
want to join the MGI-Technical listserve (http://www 
.informatics.jax.org/mgihome/lists/lists.shtml) to receive 
technical updates about the database. 
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X 
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protein coding 
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2 
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+ 
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Radiation 
Induced allele 
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11 
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2 
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+ 
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Induced allele 
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12 
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12 
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*** 


Targeted allele 


Areg lnilDIC 


amphiregulin; targeted 
mutation 1, David C Lee 


5 


91568641-91577458 




Phenotype : waved hair and more detail... 


Showing 1-10 of 146 


Show first 100... 










Get more data | for genome features 1 through 100 



Figure 2. Screenshot of the results of querying for 'wavy' using the MGI Quick Search box. Note that heritable phenotypic markers that identify 
mutants whose underlying gene is not yet identified, such as Wf and Wtgr are retrieved, as well as genes (e.g. Paxl, with synonym of wavy tail), 
and other types of mutant alleles in defined genes (e.g. Paxl""' ex , undulated extensive mutation of the Paxl gene). 



OTHER INFORMATION 

Mouse gene, allele and strain nomenclature 

MGD is the international authoritative source of 
symbols and names for mouse genes, alleles and strains. 
MGD follows and implements the guidelines set by 
the International Committee on Standardized Genetic 
Nomenclature for Mice (http://www.informatics.jax.org/ 
nomen). This official nomenclature is widely disseminated 
through regular data exchange and curation of shared 
links between MGI and other bioinformatics resources. 
MGD staff members work with editors of journal publi- 
cations and consortium projects to promote adherence 
to mouse nomenclature standards in publications and 
online data resources. 

To support consistency of nomenclature across species, 
MGD coordinates names and symbols for genes and 
genome features with nomenclature experts from the 
Human Gene Nomenclature Committee (HGNC) (20) 
(http://www.genenames.org/) and the Rat Genome 
Database (RGD) (21) http://rgd.mcw.edu). The MGD 
nomenclature coordinator can be contacted by email 
(nomen@informatics.jax.org). 

Programmatic and bulk data access 

Portions of the database are accessible programmatically 
using web services and BioMart. The MGI web service 



accepts SOAP 1.1 and 1.2 requests. For details, see 
http://www.informatics.jax.org/mgihome/other/web_ 
service. shtml. The MGD BioMart is accessible at http:// 
biomart.informatics.jax.org. Additional information 
about MartServices can be found at http://www.biomart 
.org/martservice.html. 

MGI also provides bulk data sets through regularly 
updated FTP reports (ftp://ftp.informatics.jax.org/pub/ 
reports/index. html) and via the MGI Batch Query tool 
(http://www.informatics.jax.org/javawi2/servlet/WIFetch? 
page = batchQF) where users can develop a customized 
bulk data set. 



Electronic data submission 

MGD accepts contributed data sets from individuals and 
organizations for any type of data maintained by the 
database. The most frequent types of contributed data 
are mutant and phenotypic allele information originating 
with the large mouse mutagenesis centers and strain 
data from repositories that contribute to the IMSR 
(http://www.findmice.org) (10). Each electronic submis- 
sion receives a permanent database accession ID. All 
data sets are associated with their source, either a publi- 
cation or an electronic submission reference. Details about 
data submission procedures can be found at http://www 
.informatics.jax.org/submit.shtml. 
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Additions and corrections to the representation of data 
and information in MGD can be submitted using the 
'Your Input Welcome' link that appears in the upper 
right hand corner of gene and allele detail pages. 

Community outreach and User Support 

The MGD resource has full time staff members who 
are dedicated to user support and training. Members of 
the User Support team can be contacted via email, web 
requests, phone or Fax. 

• World wide web: http://www.informatics.jax.org/ 
mgihome/support/support.shtml 

• Email access: mgi-help@informatics.jax.org 

• Telephone access: 1 207 288 6445 

• Fax access: 1 207 288 6132 

MGD User Support staff are available for on-site training 
on the use of MGD and other MGI data resources. 
MGD's traveling tutorial program (roadshow) includes 
lectures, demos and hands-on tutorials, which can be 
customized according to the research interests of the 
audience. To inquire about sponsoring a MGD 
roadshow, send email to mgi-help@informatics.jax.org. 

On-line training materials for MGD and other MGI 
data resources are available as FAQs and on-demand 
help documents. 

Other outreach 

MGI-LIST (http://www.informatics.jax.org/mgihome/ 
lists/lists. shtml) is a moderated and active email bulletin 
board supported by the MGD User Support group. The 
MGI listserve has over 2100 subscribers. On average, there 
are three posts per day, every day. The MGI-Technical 
listserve also has been instituted for technical information 
for software developers and bioinformaticians accessing 
MGI data, using APIs, and making links to MGI. 

HIGH LEVEL OVERVIEW OF THE MAIN 
COMPONENTS AND IMPLEMENTATION 

The MGD production database comprises approximately 
180 tables within which biological information is encoded. 
As we are transitioning between database engines, we cur- 
rently have instances in both Sybase and PostgreSQL. 
BLAST-able databases, genome assembly files for 
sequence data and images are stored outside the relational 
database. An editing interface and automated load 
programs are used to input data into the MGD system. 
Automated loads enter/update the bulk of data and 
associations in MGD. A typical load will load 'as much 
as it can'(typically, the large majority) and report the rest 
in various quality control reports. These are reviewed by 
curators, who may resolve problem cases by editing MGD 
and/or by communicating with data providers. The inter- 
active graphical editing interface provides curators with 
the ability to update the database, enter new data from 
the literature, track curation status, etc. 

Public data access to MGD is provided primarily 
through the web interface where users can interactively 
query and download our data through a web browser. 



MouseBLAST allows users to do sequence similarity 
searches against a variety of rodent sequence databases 
that are updated weekly from selected sequence databases 
from NCBI, UniProt and other providers. Mouse 
GBrowse allows users to visualize mouse data sets 
against the genome as a series of linear tracks. All 
MGD files and programs are openly and freely available. 

We continue to provide MGD BioMart with the 
addition of new classification terms for genome features. 
MGD BioMart supports chaining to several other 
BioMarts including Ensembl, VEGA and RGD. 
Additional functionalities such as the ability to filter by 
GO, MP Ontology and OMIM terms, and including 
additional information about alleles, are planned for 
future extensions. MGD BioMart is updated on a 
weekly basis. 

CITING MGD 

For a general citation of the MGI resource please cite this 
article. In addition, the following citation format is 
suggested when referring to data sets specific to the 
MGD component of MGI: MGD, MGI, The Jackson 
Laboratory, Bar Harbor, Maine (URL: http://www 
.informatics.jax.org). [Type in date (month, year) when 
you retrieved the data cited.]. 
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